Objective
To showcase the minimum number of steps
required to do tertiary analysis of DNA + Protein
and some of the different ways to look at the data
Major questions answered:
Things not shown:
All available methods eg. Filtering of nearby variants, variant annotation, plots
Discussing all methods and their options - Documented here
Systemic variations seen in protein data
H5 files are a replacement of loom files.
Where to get them?
These are part of the DNA and protein pipeline output
The sample h5 used in this workflow can be found here
Note: This is a h5 file trimmed specifically for this analysis
import missionbio.mosaic.io as mio
h5path = './sample.h5'
sample = mio.load(h5path, raw=False)
Dna, Cnv, and Protein are sub classes of the _Assay class
The information is stored in four ways, and the user
can change each of those
1. metadata (add_metadata / del_metadata):
dictionary containing metrics of the assay
2. row_attrs (add_row_attr / del_row_attr):
dictionary which contains 'barcode' as one of
the keys. All the values must be of the same
length i.e. match the number of barcodes
This is the attribute where 'label', 'pca',
and 'umap' values are added
3. col_attrs (add_col_attr / del_col_attr):
dictionary which contains 'ids' as one of
the keys. All the values must be of the same
length i.e. match the number ids
'ids' contains variants for DNA assays
and anitobides for Protein assays
4. layers (add_layer / del_layer):
dictionary containing 'read_counts' as one of
the metrics. All the values have the shape
(num barcodes) x (num ids). This is the attribute
where 'normalized_counts' will be added
sample.protein
sample.protein.metadata
sample.protein.row_attrs
sample.protein.ids()
sample.dna.layers
Topcis covered
Many filtering options are available
use the documentation shared earlier,
or the help() function to get the same
information here
help(sample.dna.filter_variants)
# Filter variants
# This is the default insights filtering method
dna_vars = sample.dna.filter_variants()
dna_vars
# Check the number of filtered variants
len(dna_vars)
Simply appnding the whitelist to the list of filtered
variants is sufficient to then select the variants
using the slice notation
i.e. sample.dna[{list of barcodes}, {list of ids}]
whitelist = ['chr1:115256513:G/A', 'chr21:44514718:C/T']
final_vars = whitelist + list(dna_vars)
len(final_vars)
# Selecting all cells and final variants
sample.dna = sample.dna[sample.dna.barcodes(), final_vars]
# Check the shape i.e. (Number of barcodes, number of ids)
# of the final filtered dna object
sample.dna.shape
Heatmaps are interactive. Clicking on it selects
the corresponding id whose value is stored in the
`selected_ids` attribute of the object
eg. sample.dna.selected_ids
sample.dna.stripplot(attribute='AF', colorby='GQ')
sample.dna.heatmap(attribute='AF')
sample.dna.selected_ids
sample.dna = sample.dna.drop(sample.dna.selected_ids)
DNA has a custom clustering method called `find_clones`
It projects the data on a UMAP and then performs
dbscan to identify unique clusters, which are then
merged in case they were formed due to missing
information
sample.dna.find_clones()
sample.dna.row_attrs
sample.dna.scatterplot(attribute='umap', colorby='label')
# AF_MISSING is the same as the AF layer except that it stores the missing values as -50 instead of 0
sample.dna.heatmap('AF_MISSING')
1. Basic filtering of barcodes ids demonstrated
2. Basic DNA filtering functionality showcased
Preliminary heatmap of CNV shows that there could be two clusters
Topics covered
sample.cnv.normalize_reads()
sample.cnv.heatmap(attribute='normalized_counts')
Here the UMAP options are kept constant
The only parameter in PCA is the number of components
Here we see how to determine this value, and the effect
when we deviate from this value
sample.cnv.run_pca(attribute='normalized_counts', components=6, show_plot=True)
sample.cnv.run_umap(attribute='pca', min_dist=0, n_neighbors=100)
# Too many PCA components
sample.cnv.run_pca(PCA_components=15, layer='noermalized_counts')
sample.cnv.run_umap(attribute='pca', min_dist=0, n_neighbors=100)
# Too few PCA components
sample.cnv.run_pca(PCA_components=3, layer='normalized_counts')
sample.cnv.run_umap(attribute='pca', min_dist=0, n_neighbors=100)
The result of the dimension reduction analysis is
visualized using a scatterplot of the umap
sample.cnv.cluster(attribute='umap', method='dbscan', eps=0.55)
sample.cnv.scatterplot(attribute='umap', colorby='label')
Given all other variables are kept constant
1. Too many PCA components may result in merging of clusters
2. Too few PCA component may result in splitting of clusters
3. The appropriate number of components can be determined using the elbow plot
Topics covered
# Downsampling and clustering similar to CNV
sample.protein.normalize_reads('CLR')
sample.protein.run_pca(attribute='normalized_counts', components=5)
sample.protein.run_umap(attribute='pca')
sample.protein.cluster(attribute='pca', method='graph-community', k=100)
sample.protein.heatmap(attribute='normalized_counts')
sample.protein.scatterplot(attribute='umap', colorby='label')
# Re cluster based on the observations from the UMAP
sample.protein.cluster(attribute='umap', method='dbscan')
# Prefered way to look at protein expression profiles
# In case of an error, make sure that ids have been selected on the heatmap and shown in sample.protein.selected_ids
sample.protein.ridgeplot(attribute='normalized_counts',
splitby='label',
features=sample.protein.selected_ids)
# UMAP with the expression for each of the selected protein overlayed
# In case of error, make sure that ids have been selected on the heatmap and shown in sample.protein.selected_ids
sample.protein.scatterplot(attribute='umap',
colorby='normalized_counts',
features=['CD34', 'CD44', 'HLA-DR'])
When `colorby` is not provided for any scatterplot
the lasso tool can be used to cluster the cells
based on the selection made
# Selction on biaxial scatterplot
# The same can be done for the UMAP when labels=False is passed
sample.protein.feature_scatter(layer='normalized_counts',
ids=['CD90', 'CD3'])
# Set the labels based on the selection in the above plot
sample.protein.set_selected_labels()
If someone is interested in trying their methods,
they can simply modify the appropriate layers, attributes
and metadata to plugin their step in this workflow
# Custom normalization by changing the `normalized_counts` layer
import numpy as np
log_reads = np.log10(10 + sample.protein.layers['read_counts'])
norm = np.divide(log_reads, log_reads.mean(axis=1).reshape(-1, 1))
sample.protein.add_layer('normalized_counts', norm)
Other examples include:
custom labels -> 'label' row_attr
custom palette -> 'palette' metadata
1. Protein analysis workflow similar to CNV
2. Different clustering methods can result in
different types of clusters being identified
3. It is possible to have custom clustering for
any scatterplot by using the lasso tool
4. Custom analysis is possible by modifying appropriate
layers, attributes and metadata
The significane of differential expression
based on a t-test can be looked at using
the `feature_signature` method
med, std, pval, tstat = sample.protein.feature_signature(layer='normalized_counts')
pval
pval = pval + 10 ** -50 + pval
pvals = -np.log10(pval) * (tstat > 0)
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 10))
sns.heatmap(pvals.T, vmax=50, vmin=2)
Conclusion
Statistical significance of the differential expression
can be ascertained. Median values can be explored for DNA
to determine the difference between clusters.
Visualization for multiple assays at once
sample.clone_vs_analyte('cnv')
sample.clone_vs_analyte('protein')
# Filtering protein and cnv to improve the visualization
sample.protein = sample.protein[:, ['CD3', 'CD90']]
sample.cnv = sample.cnv[:, 58:85]
sample.clone_vs_analyte('protein')
# Certain clones can also be dropped, but they must be dropped from all assays
# Hence the sample object is sliced in this case
# In this case it is better to store the new sample in a separate variable
# This returns the dna barcodes with the given labels
select_bars = sample.dna.barcodes(['2', '3', '4'])
sample_subset = sample[select_bars]
sample_subset.clone_vs_analyte('protein')
# The ids can also be reset to the entire set
sample.reset('cnv')
sample.reset('protein')
sample.clone_vs_analyte('protein')
sample.heatmap(clusterby='dna', sortby='protein', drop='cnv', flatten=False)
# Try the following
# sample.heatmap(clusterby='dna', sortby='protein', drop='cnv', flatten=True)
# sample.heatmap(clusterby='protein', sortby='dna', drop='cnv', flatten=False)
# sample.heatmap(clusterby='dna', sortby='protein', flatten=False)
The analysis can be saved to an h5 file.
This final trimmed file will be much smaller than the original h5 file.
It can be opened in Insights, or back again in Mosaic
mio.save(sample, './sample.analyzed.h5')
Data from h5 files can be efficiently manipulated,
visualized, and inferred using Mosaic.